This document reports my study on the impact of several characteristics on wine quality.
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## [1] "3" "4" "5" "6" "7" "8"
More information regarding wines [Cortez et al., 2009]. Input variables (based on physicochemical tests):
fixed acidity (tartaric acid - g / dm^3)
volatile acidity (acetic acid - g / dm^3)
citric acid (g / dm^3)
residual sugar (g / dm^3)
chlorides (sodium chloride - g / dm^3
free sulfur dioxide (mg / dm^3)
total sulfur dioxide (mg / dm^3)
density (g / cm^3)
pH
sulphates (potassium sulphate - g / dm^3)
alcohol (% by volume)
quality (score between 0 and 10)
Ploting all variables but X (which appears to be an unique identifier) to get the feeling about the distribution of the values. I have also plot the density mass. The basic stats such as mean and median have been presented in the previous section.
pH and densitiy seem to be normally distributed. All the others, but quality, have a long tail.
There are 1599 observations of 13 numeric variables.
The first feature of interest is the wine quality. The range of the wine quality is between 3 and 8, being 6.0 the median quality 6.0. A vast majority of the wines have a rating of either 5 or 6.
I have decided to instantiate a categorical variable rating wines qualitatively as bad, average, and good. One can see that most wines have been rated as average.
Second, I investigated the citric acid because I noticed that there are many observations that equal 0 in the dataset (132 observations to be precise). According to me this would require further investigation to find out whether this values were properly reported.
## [1] 132
Other variables of interest are pH and densitiy as they seem to be normally distributed. Alcohol and total/free sulphur dioxide look to have a long tail and being skewed towards 0.
I think further investigation wether pH could be classified into categorial (acid, base, neutral) could potentially be of interest. Also the relationship between fixed.acidity, volatile.acidity, and citric.acid (in particular this one due to the large number of 0s) could be interesting to further investigate in order to understand if the values are properly reported. This would require to understand the theoretical relationship between these variables.
I only created the variable (ordered factor) rating to classify wine as good, bad or average. The information that the other variables store there does not seem to be any other variable that would fit well a sub-classification into categorical variables, perhaps with the exception of residual pH (neutral, base, acid).
I’ve not found the need to tidy, adjust, or change the form of the data.
I’ve discussed the distributions in the section about the features of interest.
Below I use boxplots to gain a better understanding wrt. outliers (except for X for being an index, and quality for being categorical). I’ve also plotted the data using a log10 scale (I plotted for all variables, although only those with long tails are interesting for the log10 scale – plots below the box plots). These plots have shown that fixed acidity and to some extent pH, chlorides, densitiy, sulphates, volatile acidity to follow a normal distribution. As for the acidity variables, this is aligned with the fact that pH seems to be normally distributed, apart from the citric acid. The reason for the latter might be the number of 0 (potentially non-responses) discussed earlier. pH is normally distribution which suggests that the data is good, since by definition it is a measure of acidity and is on a logarithmic scale.
I start the bivariate analysis by doing the GGPairs correlation matrix.
The box plots below show, using another perspective, the correlation beteen variables and quality. I’ve also used scatter plotts using jitter to have another perspective on the distribution of the values. Colors corresponde to the qualitative rating.
I observed that the following characteristics yields ‘good’ wines:
It looks like the following yields ‘average’/‘good’ wines:
Residual sugar and chlorides did not seem to have an impact on the quality or rating of the wines.
I studied the relationship between the variables that correlate the most with quality: citric acid, sulphates, alcohol, and volatile acidity (see next question).
Appart from the higher the alchol the better the wine, the Pearson’s test revealed that volatile aciditity and citric acid have a strong negative correlation.
##
## Pearson's product-moment correlation
##
## data: citric.acid and alcohol
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06121189 0.15807276
## sample estimates:
## cor
## 0.1099032
##
## Pearson's product-moment correlation
##
## data: sulphates and alcohol
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04477906 0.14196454
## sample estimates:
## cor
## 0.09359475
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
##
## Pearson's product-moment correlation
##
## data: sulphates and citric.acid
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2678558 0.3563278
## sample estimates:
## cor
## 0.31277
##
## Pearson's product-moment correlation
##
## data: sulphates and volatile.acidity
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3060917 -0.2147125
## sample estimates:
## cor
## -0.2609867
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
The quality shows the strongest correlation with alcohol (0.4761663). Note that there is also a negative strong correlation with volatile acidity (-0.3905578), and sulphates and alcohol show the weakest bi-variate relationship.
## [1] "Correlation with fixed acidity"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
## [1] "Correlation with volatile acidity"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
## [1] "Correlation with citric acidity"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
## [1] "Correlation with residual sugar"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
## [1] "Correlation with chlorides"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
## [1] "Correlation with fixed sulfur acidity"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
## [1] "Correlation with total sulfur acidity"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
## [1] "Correlation with density"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
## [1] "Correlation with pH"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
## [1] "Correlation with sulphates"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
## [1] "Correlation with alcohol"
##
## Pearson's product-moment correlation
##
## data: v and as.numeric(quality)
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Below there are scatterplots to show correlation between variables. I am ploting those variables that have shown strong correlation with quality. I am using rating as the ordered color scheme.
I focused on the 4 features that showed strong correlation with quality, plus pH. The plots show that a higher citric acid, higher sulphates, higher alchohol, and lower volatile acid are key factors to achieve a high quality wine.
pH does not seem to have a high impact on wine quality, despite the fact that the acids do play a role.
I’ve created a model to predict the quality of wine given its alcohol. I did this because alcohol has revealed to have strong correlation with quality. The linear model didn’t work well in practice. I think the main reason is because quality is a categorical value, hence not very suitable to linear models. It might be the case it is possible to predict pH from the acids.
In the plot, I am using ranting as the ordered color scheme.
These plots show the effect of acidity and pH on wine quality. I observe that higher acidity (hence, lower pH), apart from the volatile acid, is shown to yield better wines. Higher acitidity and lower pH also do make sense because these variables are correlated. It remains to be understood why volatile acid is not as high as the others, and hence in line with lower pH.
The impact of fixed acid on quality is observed to be marginal.
These plots show the effect of alcohol on wine quality. On average, we observe that the wine quality increases with alcohol, except for wine quality 5. To further understand the differences between quality 4 and 5, I computed the stats considering only wines with that rating. I couldn’t conclude much, but I believe that the conclusion that wine quality increases with alcohol is correct and this is just an artifact of this dataset (and, obviously, the subjectivity of what rating wines entail).
## [1] "Alcohol summaries for quality = 4"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## [1] "Alcohol summaries for quality = 5"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
As for sulphates, I have observed that wine quality increase with sulphates. I did not observe much sulphate variablity for good wines, meaning that sulphates are important up to a certain level or that it reached it maximum value (knowing more about wines would help here!).
This boxplot shows the the correlation betwwen quality and residual sugar vs. alcohol. I observe that higher alcohol levels and lower residual sugars levels lead to better red wines.
I tried to add more information to the plot (such as medians) but the plot became difficult to read. It increased however my confidence in the conclusion I outlined before.
The conclusion also goes in line with what is known about red wines. They tend to be less sweet than white wines.
The study about wine wines has revealed the following:
Most wines in the dataset are rated as ‘average’, just a small number of wines are rated as ‘bad’.
There might be a problem with the citric acid, as there are many entries set to 0.
Higher alchold positively impact the rating of the wines (i.e., better wines).
More alcohol and low residual sugars yield to good red wines. [My intuition tells me that this might be different for white wines!]
Residual sugar and chlorides did not seem to have an impact on the quality or rating of the wines.
The lower volatile acidity the better the wine (string negative correlation)
Given that only alcohol correlated with quality of wine, this may suggest that rating a wine is not subject to the objectivity of the exports. There might also be the case that there are other factors which are not represented in the dataset (vintage, harvest year, location of the vineyards, temperatures before harvesting the wine, etc). Trying to obtain this information would be interesting for further exploration. Confirming this results using the white wine dataset woud also be of particular interest to me.
My major struggle in the course of this project was to find more information about wines that would be important to this study, as I am not a wine connoisseur. I have spent a fair amount of time looking online for more information about wines. Being a wine lover, but not a connoisseur, I think this project helped increase my knowledge about wines. I think next time I will drink wine I will recall this project and check if my rating of the wine matches what I have discovered.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.